The Development of the Multilingual LUNA Corpus for Spoken Language System Porting

نویسندگان

  • Evgeny A. Stepanov
  • Giuseppe Riccardi
  • Ali Orkan Bayer
چکیده

The development of annotated corpora is a critical process in the development of speech applications for multiple target languages. While the technology to develop a monolingual speech application has reached satisfactory results (in terms of performance and effort), porting an existing application from a source language to a target language is still a very expensive task. In this paper we address the problem of creating multilingual aligned corpora and its evaluation in the context of a spoken language understanding (SLU) porting task. We discuss the challenges of the manual creation of multilingual corpora, as well as present the algorithms for the creation of multilingual SLU via Statistical Machine Translation (SMT).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Standoff Coordination for Multi-Tool Annotation in a Dialogue Corpus

The LUNA corpus is a multi-lingual, multidomain spoken dialogue corpus currently under development that will be used to develop a robust natural spoken language understanding toolkit for multilingual dialogue services. The LUNA corpus will be annotated at multiple levels to include annotations of syntactic, semantic, and discourse information; specialized annotation tools will be used for the a...

متن کامل

Design and Data Collection for Spoken Polish Dialogs Database

Spoken corpora provide a critical resource for research, development and evaluation of spoken dialog systems. This paper describes the telephone spoken dialog corpus for Polish created by Polish-Japanese Institute of Information Technology team within the LUNA project (IST 033549). The main goal of this project is to create a robust natural spoken language understanding (SLU) toolkit, which can...

متن کامل

Annotation of Polish spoken dialogs in LUNA project

In this paper we present general assumptions and goals of the LUNA (spoken Language UNderstanding in multilinguAl communication systems) project. We describe the process of collecting a Polish corpus of spoken dialogs and the accepted annotation schema of this corpus at several levels, from transcription of dialogs and morphosyntactic analysis, to semantic and dialog acts annotation.

متن کامل

The Effect of CMC in Business Emails in Lingua Franca: Discourse Features and Misunderstandings

The paper argues that everyday exchange of business emails produces a development in the work-group relationship, which, in turn, makes new communication styles possible and acceptable by the users' habit to computer-mediated forms, even in unbalanced professional exchanges. The focus is on the (spoken) discourse features of email messages in a self-compiled corpus of selected computer-mediated...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014